95 research outputs found
Changing Views on Curves and Surfaces
Visual events in computer vision are studied from the perspective of
algebraic geometry. Given a sufficiently general curve or surface in 3-space,
we consider the image or contour curve that arises by projecting from a
viewpoint. Qualitative changes in that curve occur when the viewpoint crosses
the visual event surface. We examine the components of this ruled surface, and
observe that these coincide with the iterated singular loci of the coisotropic
hypersurfaces associated with the original curve or surface. We derive
formulas, due to Salmon and Petitjean, for the degrees of these surfaces, and
show how to compute exact representations for all visual event surfaces using
algebraic methods.Comment: 31 page
Multigraded Cayley-Chow forms
We introduce a theory of multigraded Cayley-Chow forms associated to
subvarieties of products of projective spaces. Two new phenomena arise: first,
the construction turns out to require certain inequalities on the dimensions of
projections; and second, in positive characteristic the multigraded Cayley-Chow
forms can have higher multiplicities. The theory also provides a natural
framework for understanding multifocal tensors in computer vision.Comment: 20 pages, 1 figur
Towards Visual Foundational Models of Physical Scenes
We describe a first step towards learning general-purpose visual
representations of physical scenes using only image prediction as a training
criterion. To do so, we first define "physical scene" and show that, even
though different agents may maintain different representations of the same
scene, the underlying physical scene that can be inferred is unique. Then, we
show that NeRFs cannot represent the physical scene, as they lack extrapolation
mechanisms. Those, however, could be provided by Diffusion Models, at least in
theory. To test this hypothesis empirically, NeRFs can be combined with
Diffusion Models, a process we refer to as NeRF Diffusion, used as unsupervised
representations of the physical scene. Our analysis is limited to visual data,
without external grounding mechanisms that can be provided by independent
sensory modalities.Comment: TLDR: Physical scenes are equivalence classes of sufficient
statistics, and can be inferred uniquely by any agent measuring the same
finite data; We formalize and implement an approach to representation
learning that overturns "naive realism" in favor of an analytical approach of
Russell and Koenderink. NeRFs cannot capture the physical scenes, but
combined with Diffusion Models they ca
Prompt Algebra for Task Composition
We investigate whether prompts learned independently for different tasks can
be later combined through prompt algebra to obtain a model that supports
composition of tasks. We consider Visual Language Models (VLM) with prompt
tuning as our base classifier and formally define the notion of prompt algebra.
We propose constrained prompt tuning to improve performance of the composite
classifier. In the proposed scheme, prompts are constrained to appear in the
lower dimensional subspace spanned by the basis vectors of the pre-trained
vocabulary. Further regularization is added to ensure that the learned prompt
is grounded correctly to the existing pre-trained vocabulary. We demonstrate
the effectiveness of our method on object classification and object-attribute
classification datasets. On average, our composite model obtains classification
accuracy within 2.5% of the best base model. On UTZappos it improves
classification accuracy over the best base model by 8.45% on average
Linear Spaces of Meanings: Compositional Structures in Vision-Language Models
We investigate compositional structures in data embeddings from pre-trained
vision-language models (VLMs). Traditionally, compositionality has been
associated with algebraic operations on embeddings of words from a pre-existing
vocabulary. In contrast, we seek to approximate representations from an encoder
as combinations of a smaller set of vectors in the embedding space. These
vectors can be seen as "ideal words" for generating concepts directly within
the embedding space of the model. We first present a framework for
understanding compositional structures from a geometric perspective. We then
explain what these compositional structures entail probabilistically in the
case of VLM embeddings, providing intuitions for why they arise in practice.
Finally, we empirically explore these structures in CLIP's embeddings and we
evaluate their usefulness for solving different vision-language tasks such as
classification, debiasing, and retrieval. Our results show that simple linear
algebraic operations on embedding vectors can be used as compositional and
interpretable methods for regulating the behavior of VLMs.Comment: 18 pages, 9 figures, 7 table
Trinocular Geometry Revisited
International audienceWhen do the visual rays associated with triplets of point correspondences converge, that is, intersect in a common point? Classical models of trinocular geometry based on the fundamental matrices and trifocal tensor associated with the corresponding cameras only provide partial answers to this fundamental question, in large part because of underlying, but seldom explicit, general configuration assumptions. This paper uses elementary tools from projective line geometry to provide necessary and sufficient geometric and analytical conditions for convergence in terms of transversals to triplets of visual rays, without any such assumptions. In turn, this yields a novel and simple minimal parameterization of trinocular geometry for cameras with non-collinear or collinear pinholes, which can be used to construct a practical and efficient method for trinocular geometry parameter estimation. We present numerical experiments using synthetic and real data
The joint image handbook
International audienceGiven multiple perspective photographs, point correspondences form the " joint image " , effectively a replica of three-dimensional space distributed across its two-dimensional projections. This set can be characterized by multilinear equations over image coordinates, such as epipolar and trifocal constraints. We revisit in this paper the geometric and algebraic properties of the joint image, and address fundamental questions such as how many and which multilinearities are necessary and/or sufficient to determine camera geometry and/or image correspondences. The new theoretical results in this paper answer these questions in a very general setting and, in turn, are intended to serve as a " handbook " reference about multilinearities for practitioners
- …